A.K.A. Titanic Attendance Sheet

Overview

This document outlines the process of creating a Random Forest predictive model in R. Our model will attempt to predict whether or not a passenger on the Titanic would have survived based on the information they provided before boarding.

To construt our model, we have already downloaded the following packages: rpart, rattle, caret, ROCR, and randomForest. We have also cleaned our data so that there are no missing variables.

Partition the Data

When using decision trees or random forest modeling, it is important to divide your data between a training dataset and a test dataset. The training dataset is what we will use to construct our model. The tesing dataset allows us to see how effectively our model can predict the results of unexamined data.

# Split the data into two sets
# 50% of the sample size
    smp_size <- floor(0.5 * nrow(data))

# Set the seed to make your partition reproductible
    set.seed(100)
    trainindex <- sample(seq_len(nrow(data)), size = smp_size)
    
    train <- data[trainindex, ]
    test <- data[-trainindex, ]

Create a Random Forest Model

For any random forest model, it is okay to include all available variables. Rather than the user deciding which variables are important, random forest models will automatically optimize and rank the most influential variables.

This model will examine whether or not someone survived based on their social class, sex, age, number of siblings and spouse on board, number of parents and children on board, ticket price, and embarking location.

The model will use the train dataset to create 1000 decisions trees. The random forest will use the findings of the decisions trees to develop a categorizing process for predicted survival.

# Random Forest
    fit <- randomForest(as.factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, data = train, importance = TRUE, ntree = 1000)

Below is the plot of our random forest. We can see how our error rate decreases and plateaus as we examine more decision trees.

    plot(fit)

Next, we can review which variables had the most useful predictive value for assessing whether someone will survive or not on the Titanic.

# Analyze importance of explanatory variables
    importance(fit)
##                  0         1 MeanDecreaseAccuracy MeanDecreaseGini
## Pclass   12.895497 34.989554             36.86166        15.260668
## Sex      70.551059 88.144675             96.18286        50.186074
## Age      16.116902 19.090688             24.59926        32.970913
## SibSp    17.436767 -6.798807             12.32441         8.140573
## Parch    18.456035 10.265035             21.74832         9.378461
## Fare     14.981120 26.043025             30.64243        36.720307
## Embarked  7.384013 14.373400             16.54230         6.271409
    varImpPlot(fit)

Testing the Random Forest

Here we will use the categorizing process of the random forest produced with the train dataset to predict who survived from the test dataset.

First, we must create a data frame of predictions for the test data, based on the random forest model.

    pred_data <- data.frame(predict(fit, test, type = "class"))

Second, we can conduct a simple misclassificaiton test. This tells us the broad accuracy of our model, which is when the predicted value matches the actual value for a passenger’s survival status.

      misclassificationError <- mean(pred_data != test$Survived)
      print(paste('Accuracy',1-misclassificationError))
## [1] "Accuracy 0.834080717488789"

Third, we will construct a confusion matrix to analyze how often our correct guesses are True-Positives and True-Negatives. A confusion matrix also allows us to see how often our model produces False-Positives and False-Negatives.

      confusionmat <- confusionMatrix(pred_data[[1]],test$Survived)
      confusionmat
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 258  51
##          1  23 114
##                                           
##                Accuracy : 0.8341          
##                  95% CI : (0.7962, 0.8674)
##     No Information Rate : 0.63            
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6312          
##  Mcnemar's Test P-Value : 0.001697        
##                                           
##             Sensitivity : 0.9181          
##             Specificity : 0.6909          
##          Pos Pred Value : 0.8350          
##          Neg Pred Value : 0.8321          
##              Prevalence : 0.6300          
##          Detection Rate : 0.5785          
##    Detection Prevalence : 0.6928          
##       Balanced Accuracy : 0.8045          
##                                           
##        'Positive' Class : 0               
## 

Fourth (and finally), a ROC curve will let us review an aggregate result of the confusion matrices that our model produces when we put increasing priority on predicting True-Positives or True-Negatives. A better model has a larger area under the curve (AUC) as we trace the results of these prioritizations.

      predroc <- data.frame(predict(fit, test, type = "prob"))
      pr <- prediction(predroc[2], test$Survived)
      prf <- performance(pr, measure = "tpr", x.measure = "fpr")
      plot(prf)

      auc <- performance(pr, measure = "auc")
      auc <- auc@y.values[[1]]
      auc
## [1] 0.8718861
Random Forest Model

Random Forest Model

We Made It!!